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Abstract 

Sparse methods for supervised learning aim at finding good linear predictors from as few 
variables as possible, i.e., with small cardinality of their supports. This combinatorial selec- 
tion problem is often turned into a convex optimization problem by replacing the cardinality 
function by its convex envelope (tightest convex lower bound), in this case the ^i-norm. In this 
paper, we investigate more general set-functions than the cardinality, that may incorporate prior 
knowledge or structural constraints which are common in many applications: namely, we show 
that for nondecreasing submodular set-functions, the corresponding convex envelope can be ob- 
tained from its Lovasz extension, a common tool in submodular analysis. This defines a family 
of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal 
operators) and theoretical results (conditions for support recovery or high-dimensional infer- 
ence). By selecting specific submodular functions, we can give a new interpretation to known 
norms, such as those based on rank-statistics or grouped norms with potentially overlapping 
groups; we also define new norms, in particular ones that can be used as non-factorial priors for 
supervised learning. 



1 Introduction 

The concept of parsimony is central in many scientific domains. In the context of statistics, signal 
processing or machine learning, it takes the form of variable or feature selection problems, and is 
commonly used in two situations: First, to make the model or the prediction more interpretable or 
cheaper to use, i.e., even if the underlying problem does not admit sparse solutions, one looks for the 
best sparse approximation. Second, sparsity can also be used given prior knowledge that the model 
should be sparse. In these two situations, reducing parsimony to finding models with low cardinality 
turns out to be limiting, and structured parsimony has emerged as a fruitful practical extension, 
with applications to image processing, text processing or bioinformatics (see, e.g., [IJ [2j [3j |4j [5j |6j [7] 
and Section^]). For example, in [3], structured sparsity is used to encode prior knowledge regarding 
network relationship between genes, while in [B], it is used as an alternative to structured non- 
parametric Bayesian process based priors for topic models. 

Most of the work based on convex optimization and the design of dedicated sparsity-inducing norms 
has focused mainly on the specific allowed set of sparsity patterns [TJ [U |4j |6j : if w £ K. p denotes the 
predictor we aim to estimate, and Supp(w) denotes its support, then these norms are designed so that 
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penalizing with these norms only leads to supports from a given family of allowed patterns. In this 
paper, we instead follow the approach of [5J [3] and consider specific penalty functions _F(Supp(w)) 
of the support set, which go beyond the cardinality function, but are not limited or designed to only 
forbid certain sparsity patterns. As shown in Section 16.21 these may also lead to restricted sets of 
supports but their interpretation in terms of an explicit penalty on the support leads to additional 
insights into the behavior of structured sparsity-inducing norms (see, e.g., Section |4~T]) . While direct 
greedy approaches (i.e., forward selection) to the problem are considered in [8j[3], we provide convex 
relaxations to the function w > F(Supp(w)), which extend the traditional link between the £i-norm 
and the cardinality function. 

This is done for a particular ensemble of set-functions F, namely nondecreasing submodular functions. 
Submodular functions may be seen as the set-function equivalent of convex functions, and exhibit 
many interesting properties that we review in Scction[5] — see [3] for a tutorial on submodular analysis 
and [lOUllj for other applications to machine learning. This paper makes the following contributions: 

— We make explicit links between submodularity and sparsity by showing that the convex envelope 
of the function w i— > -F(Supp(u>)) on the foo-ball may be readily obtained from the Lovasz extension 
of the submodular function (Section [3]). 

— We provide generic algorithmic tools, i.e., subgradients and proximal operators (Section [5]) , 
as well as theoretical guarantees, i.e., conditions for support recovery or high-dimensional inference 
(Section [6]), that extend classical results for the ^i-norm and show that many norms may be tackled 
by the exact same analysis and algorithms. 

— By selecting specific submodular functions in Sectional we recover and give a new interpretation 
to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping 
groups [2[U[7], an d we define new norms, in particular ones that can be used as non- factorial priors 
for supervised learning (Section [4]). These are illustrated on simulation experiments in Section [71 
where they outperform related greedy approaches [3]. 

Notation. For w £ R p , Supp(w) C V = {1, . . . ,p} denotes the support of w, defined as Supp(u>) = 
{j £ V, wj 7^ 0}. For w £ W and q E [l,oo], we denote by \\w\\ q the £ g -norm of w. We denote 
by \w\ G M. p the vector of absolute values of the components of w. Moreover, given a vector w and 
a matrix Q, wa and Qaa are the corresponding subvector and submatrix of w and Q. Finally, for 
w eK J and A C V, w(A) = XfceA w k (this defines a modular set-function). 

2 Review of submodular function theory 

Throughout this paper, we consider a nondecreasing submodular function F defined on the power 
set 2 V of V = {1, . . . ,p}, i.e., such that: 



Moreover, we assume (without loss of generality) that F(0) = 0. These set-functions are often 
referred to as polymatroid set-functions |12| or ft -functions |13| . Also, without loss of generality, 
we may assume that F is strictly positive on singletons, i.e., for all k £ V, F({k}) > 0. Indeed, if 
F({k}) = 0, then by submodularity and monotonicity, if A 9 k, F(A) = F(A\{k}) and thus we can 
simply consider V^\{fc} instead of V. 

Classical examples are the cardinality function (which will lead to the ^-norm) and, given a partition 
of V into Bi U • • ■ U Bf. = V, the set function A H> F{A) which is equal to the number of groups 
B\, . . . , Bk with non empty intersection with A (which will lead to the grouped £i/£oo-norm [T1I14]). 



WA, BcV, 
VA, BcV, 



F(A) + F(B) ^ F(A UB) + F(A n B), (submodularity) 
A C B F(A) < F(B). (monotonicity) 
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Lovasz extension. Given any set-function F, one can define its Lovdsz extension [T5] (a.k.a. Cho- 
quet integral [16) ) / : M? + — > M, as follows: given w € R?L, we can order the components of w in 
decreasing order Wj 1 ^ • • • ^ 0; the value is then defined as: 

f(w) = u>j k [F({h, ■ ■ -,3k}) - F({h, . . . ,i fc _!})]. (1) 

k=l 

Note that if some of the components of w are equal, all orderings lead to the same value of f(w). 
The Lovasz extension / is always piecewise-linear, and when F is submodular, it is also convex 
(see, e.g., JT5l [12]). Moreover, for all S € {0, 1} P , f(5) = F(Supp(5)): / is indeed an extension 
from vectors in {0, 1} P (which can be identified with indicator vectors of sets) to all vectors in R p . 
Moreover, it turns out that minimizing F over subsets, i.e., minimizing / over {0, 1} P is equivalent 
to minimizing / over [0, l] p jT5l [13] . 

Submodular polyhedron and greedy algorithm. We denote by V the submodular polyhe- 
dron Q2], defined as the set of s e M p + such that for all A C V, s(A) < F(A), i.e., V = {s € 
R+, VA C V, s(A) < F(A)}, where we use the notation s(A) = J2keA s k- O ne important re- 
sult in submodular analysis is that if F is a nondecreasing submodular function, then we have a 
representation of / as a maximum of linear functions |12 [ 115 ) . i.e., for all w G R5_, 

f(w) = max w T s. (2) 

Instead of solving a linear program with p + 2 P contraints, a solution s may be obtained by the 
following "greedy algorithm" : order the components of w in decreasing order Wj 1 ^ • • • Wj , and 
then take for all k e {1, . . . ,p}, s lk = F({ji, . . . ,j k }) - F({h, ■ ■ ■ Jk-i})- 

Stable sets. A set A is said stable if it cannot be augmented without increasing F, i.e., if for all sets 
B D A, B 7^ A F(B) > F(A). If F is strictly increasing, then all sets are stable. Stable sets are 
also sometimes referred to as flat or closed [13] . The set of stable sets is closed by intersection fTS] , 
and will correspond to the set of allowed sparsity patterns (see Section 1572"]) . For the cardinality 
function, all sets are stable. 

Separable sets. A set A is separable if we can find a partition of A into A = B\ U • • ■ U Bk such 
that F(A) = F(Bi) + ■ ■ ■ + F(Bk). A set A is inseparable if it is not separable. As shown in [13j . 
the submodular polytope V has full dimension p as soon as F is strictly positive on all singletons, 
and its faces are exactly the sets {sk = 0} for k € V and {s(A) = F(A)} for stable and inseparable 
sets. We let denote T the set of such sets. This implies that P = {s€ M p , VA e T, s(A) < F(A)}. 
These stable inseparable sets will play a role when describing extreme points of unit balls of our 
new norms (Section [3} and for deriving concentration inequalities in Section [6.31 For the cardinality 
function, stable and inseparable sets are singletons. 



Submodular function minimization. Submodular functions are particularly interesting be- 
cause they can be minimized in polynomial time. In this paragraph, we consider a non-monotonic 
submodular function G (otherwise finding the minimum is trivial). Most algorithms for minimizing 
submodular functions rely on the following strong duality principle |13[ 112] : 

min G(A) = max > min{0, si-}. (3) 

AdV V ' s£B(G) ^ 

kev 

where B(G) = {s G W,yA C V,s(A) < G(A),s(V) = G(V)} is referred to as the base poly- 
hedron. Moreover, algorithms for minimizing G will usually output A and s such that G(A) = 
5^ fc6 y min{0, Sfc} as a certificate for optimality. The two main types of algorithms are combinato- 
rial algorithms (that explicitly looks for A) and ones based on convex optimization (that explicitly 
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Figure 1: Polyhedral unit ball, for 4 different submodular functions (two variables), with different 
stable inseparable sets leading to different sets of extreme points; changing values of F may make 
some of the extreme points disappear. From left to right: F(A) = lAj 1 / 2 (all possible extreme 
points), F(A) = \A\ (leading to the ^i-norm), F(A) = min{|A|, 1} (leading to the ^-norm), F(A) = 
^{An{2}^0} + 1{A^0} (leading to the structured norm tt(w) = ^|u>2| + IMloo)- 

looks for s). The first type of algorithm leads to strongly polynomial algorithms with best known 
complexity 0(p e ) [17| . while the minimum point algorithm of |12| has no worst-time complexity 
bounds but is usually much faster in practice [12] and is based on the equivalent problem of finding 
the minimum-norm point in B(G), i.e., min sg g(g) 1 1 | j § - Note that in this case, the minimum point 
algorithm also outputs a particular s solution of Eq. ([3]) — which has several solutions in general. 



3 Definition and properties of structured norms 

We define the function fl(w) = f{\w\), where |w| is the vector in W composed of absolute values 
of w and / the Lovasz extension of F. We have the following properties (see proof in the appendix), 
which show that we indeed define a norm and that it is the desired convex envelope: 

Proposition 1 (Convex envelope, dual norm) Assume that the set-function F is submodular, 
nondecreasing , and strictly positive for all singletons. Define fl : w M> /(|w|). Then: 

(i) is a norm on W, 

(ii) SI is the convex envelope of the function g : w F'(Supp(u))) on the unit i^-ball, 

(Hi) the dual norm (see, e.g., \18f ) of Q is equal to Jl*(s) = max^ c y = max^T |j^|y ■ 

We provide examples of submodular set-functions and norms in Section [4] where we go from set- 
functions to norms, and vice- versa. From the definition of the Lovasz extension in Eq. (fTJ), we see 
that n is a polyhedral norm (i.e., its unit ball is a polyhedron). The following proposition gives the 
set of extreme points of the unit ball (see proof in the appendix and examples in Figure [1} : 

Proposition 2 (Extreme points of unit ball) The extreme points of the unit ball of Vl are the 
vectors ^rn s, with s € {—1, 0, 1} P , Supp(s) = A and A a stable inseparable set. 

This proposition shows, that depending on the number and cardinality of the inseparable stable sets, 
we can go from 2p (only singletons) to 3 P — 1 extreme points (all possible sign vectors). We show 
in Figure [T] examples of balls for p = 2, as well as sets of extreme points. These extreme points will 
play a role in concentration inequalities derived in Section [5] 
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Figure 2: Sequence and groups: (left) groups for contiguous patterns, (right) groups for penalizing 
the number of jumps in the indicator vector sequence. 
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log(X) log(X) \og(X) log(X) 

Figure 3: Rcgularization path for a penalized least-squares problem (black: variables that should 
be active, red: variables that should be left out). From left to right: ^-norm penalization (a wrong 
variable is included with the correct ones), polyhedral norm for rectangles in 2D, with zoom (all 
variables come in together), mix of the two norms (correct behavior). 



4 Examples of nondecreasing submodular functions 

We consider three main types of submodular functions with potential applications to rcgulariza- 
tion for supervised learning. Some existing norms are shown to be examples of our frameworks 
(Section B~T1 Section FO]) . while other novel norms are designed from specific submodular functions 
(Section I4. 2[) . Other examples of submodular functions, in particular in terms of matroids and en- 
tropies, may be found in [12j [TOj [TT| and could also lead to interesting new norms. Note that set 
covers, which are common examples of submodular functions are subcases of set-functions defined 
in Section |4J] (see, e.g., [9]). 



4.1 Norms defined with non-overlapping or overlapping groups 

We consider grouped norms defined with potentially overlapping groups [TJI2], i.e., Q(w) = J2gcv ^(^OII^gII 
where d is a nonnegative set-function (with potentially d(G) = when G should not be considered 
in the norm). It is a norm as soon as l JQ t d(G)>oG = V and it corresponds to the nondecreasing 
submodular function F(A) = X^GnA^0 d(G). In the case where ^oo-norms are replaced by ^-norms, 
[2] has shown that the set of allowed sparsity patterns are intersections of complements of groups G 
with strictly positive weights. These sets happen to be the set of stable sets for the corresponding 
submodular function; thus the analysis provided in Section [6.21 extends the result of [2] to the new 
case of foo-norms. However, in our situation, we can give a rcintcrpretation through a submodular 
function that counts the number of times the support A intersects groups G with non zero weights. 
This goes beyond restricting the set of allowed sparsity patterns to stable sets. We show later in 
this section some insights gained by this reinterpretation. We now give some examples of norms, 
with various topologies of groups. 

Hierarchical norms. Hierarchical norms defined on directed acyclic graphs [TJ [5j [6] correspond 
to the set-function F(A) which is the cardinality of the union of ancestors of elements in A. These 
have been applied to bioinformatics [5], computer vision and topic models [6]. 
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Norms defined on grids. If we assume that the p variables are organized in a ID, 2D or 3D 
grid, [2] considers norms based on overlapping groups leading to stable sets equal to rectangular 
or convex shapes, with applications in computer vision |19) . For example, for the groups defined 
in the left side of Figured] (with unit weights), we have F(A) = p — 2 + range(A) if A ^ and 
F(0) = (the range of A is equal to max(A) — mm(A) + 1). From empty sets to non-empty sets, 
there is a gap of p — 1, which is larger than differences among non-empty sets. This leads to the 
undesired result, which has been already observed by [2], of adding all variables in one step, rather 
than gradually, when the regularization parameter decreases in a regularized optimization problem. 
In order to counterbalance this effect, adding a constant times the cardinality function has the effect 
of making the first gap relatively smaller. This corresponds to adding a constant times the ^i-norm 
and, as shown in Figure [31 solves the problem of having all variables coming together. All patterns 
are then allowed, but contiguous ones are encouraged rather than forced. 

Another interesting new norm may be defined from the groups in the right side of Figure [21 Indeed, 
it corresponds to the function F(A) equal to \A\ plus the number of intervals of A. Note that this 
also favors contiguous patterns but is not limited to selecting a single interval (like the norm obtained 
from groups in the left side of Figure [2]) . Note that it is to be contrasted with the total variation 
(a.k.a. fused Lasso penalty [20), which is a relaxation of the number of jumps in a vector w rather 
than in its support. In 2D or 3D, this extends to the notion of perimeter and area, but we do not 
pursue such extensions here. 

4.2 Spectral functions of submatrices 

Given a positive scmidefinite matrix Q £ M. pxp and a real-valued function h from M + — » R, one 
may define ti[h(Q)} as X)f=i where Ai, . . . , X p arc the (nonnegative) eigenvalues of Q [21]. We 

can thus define the set-function F(A) = trh(QAA) for A C V. The functions h(X) = log(A + 1) for 
t lead to submodular functions, as they correspond to entropies of Gaussian random variables 
(see, e.g., Thus, since for q e (0,1), A« = / °° log(l + X/^t^dt (see, e.g., [H]), 

h(X) = X q for q <G (0, 1] are positive linear combinations of functions that lead to nondecreasing 
submodular functions. Thus, they are also nondecreasing submodular functions, and, to the best of 
our knowledge, provide novel examples of such functions. 

In the context of supervised learning from a design matrix X <G R™ xp , we naturally use Q = X T X. 
If h is linear, then F(A) = tr X\Xa = J2keA -^-k^ (where Xa denotes the submatrix of X with 
columns in A) and we obtain a weighted cardinality function and hence and a weighted fi-nonn, 
which is a factorial prior, i.e., it is a sum of terms depending on each variable independently. 

In a frequcntist setting, the Mallows Cl penalty [23] depends on the degrees of freedom, of the 
form X\X a{X\X a + A/) -1 . This is a non- factorial prior but unfortunately it does not lead to 
a submodular function. In a Bayesian context however, it is shown by [24j that penalties of the 
form logdetpQ Xa + XI) (which lead to submodular functions) correspond to marginal likelihoods 
associated to the set A and have good behavior when used within a non-convex framework. This 
highlights the need for non- factorial priors which are sub-linear functions of the eigenvalues of X]±Xa, 
which is exactly what nondecreasing submodular function of submatrices are. We do not pursue the 
extensive evaluation of non-factorial convex priors in this paper but provide in simulations examples 
with F(A) = tv(X\X A ) 1/2 (which is equal to the trace norm of X A [18]). 

4.3 Functions of cardinality 

For F(A) = where h is nondecreasing, such that h(0) = and concave, then, from Eq. ([TJ, 

Sl(w) is defined from the rank statistics of \w\ € R+, i.e., if > 1^(2)1 ^ • ^ l w (p)|) then 
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= X]fc=i — h(k — l)]\wr]A I ■ This includes the sum of the q largest elements, and might lead 
to interesting new norms for unstructured variable selection but this is not pursued here. However, 
the algorithms and analysis presented in Section [5] and Section [5] apply to this case. 

5 Convex analysis and optimization 

In this section we provide algorithmic tools related to optimization problems based on the regular- 
ization by our novel sparsity-inducing norms. Note that since these norms are polyhedral norms with 
unit balls having potentially an exponential number of vertices or faces, regular linear programming 
toolboxes may not be used. 

Subgradient. From Q,(w) = max s£ p s T \w\ and the greedy algorithm^ presented in Section[21 one 
can easily get in polynomial time one subgradient as one of the maximizcrs s. This allows to use 
subgradient descent, with, as shown in Figure 21 slow convergence compared to proximal methods. 

Proximal operator. Given regularized problems of the form min^gRp L(w) + Xfl(w), where 
L is differentiable with Lipschitz-continuous gradient, proximal methods have been shown to be 
particularly efficient first-order methods (see, e.g., [25)). In this paper, we consider the methods 
"ISTA" and its accelerated variants "FISTA" [25] , which are compared in Figure 2J 

To apply these methods, it suffices to be able to solve efficiently problems of the form: min^gRp g||io— 
z\\% + X£l(w). In the case of the £i-norm, this reduces to soft thresholding of z, the following 
proposition (see proof in the appendix) shows that this is equivalent to a particular algorithm 
for submodular function minimization, namely the minimum-norm-point algorithm, which has no 
complexity bound but is empirically faster than algorithms with such bounds [12] : 

Proposition 3 (Proximal operator) Let z G W and X > 0, minimizing ^\\w — z\\^ + Xfl(w) 
is equivalent to finding the minimum of the submodular function A t— > XF(A) — |z|(-<4) with the 
minimum-norm-point algorithm. 

In the proof, it is shown how a solution for one problem may be obtained from a solution to 
the other problem. Moreover, any algorithm for minimizing submodular functions allows to get 
directly the support of the unique solution of the proximal problem and that with a sequence of 
submodular function minimizations, the full solution may also be obtained. Similar links between 
convex optimization and minimization of submodular functions have been considered (see, e.g., |26j). 
However, these are dedicated to symmetric submodular functions (such as the ones obtained from 
graph cuts) and are thus not directly applicable to our situation of non-increasing submodular 
functions. 

Finally, note that using the minimum-norm-point algorithm leads to a generic algorithm that can 
be applied to any submodular functions F, and that it may be rather inefficient for simpler subcases 
(e.g., the £i/^oo-norm, tree-structured groups (6j, or general overlapping groups [7]). 

6 Sparsity-inducing properties 

In this section, we consider a fixed design matrix X £ R nxp and y £ M. n a vector of random responses. 
Given A > 0, we define liasa minimizer of the regularized least-squares cost: 

min u , eK p ±\\y - Xw\\% + Afi(to). (4) 

lr The greedy algorithm to find extreme points of the submodular polyhedron should not be confused with the 
greedy algorithm (e.g., forward selection) that we consider in Section [71 
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We study the sparsity-inducing properties of solutions of Eq. ([5]), i.e., we determine in Section 16.21 
which patterns are allowed and in Section [6.31 which sufficient conditions lead to correct estimation. 
Like recent analysis of sparsity-inducing norms |27j , the analysis provided in this section relies heavily 
on decomposability properties of our norm f2. 

6.1 Decomposability 

For a subset J of V, we denote by Fj : 2 J — > R the restriction of F to J, defined for A C J 
by Fj{A) = F(A), and by F J : 2 J " -> R the contraction of F by J, defined for A C J c by 
-F j (j4) = F(AUJ) - F(A). These two functions are submodular and nondecreasing as soon as F is 
(see, e.g., [12]). 

We denote by flj the norm on R J defined through the submodular function Fj, and ft J the pseudo- 
norm defined on R J defined through F J (as shown in Proposition [4j it is a norm only when J is 
a stable set). Note that tljc (a norm on J c ) is in general different from Vl J . Moreover, ttj(w,j) is 
actually equal to f2(u)) where u>j = uij and wjc = 0, i.e., it is the restriction of f2 to J. 

We can now prove the following decomposition properties, which show that under certain circum- 
stances, we can decompose the norm Q on subsets J and their complements: 

Proposition 4 (Decomposition) Given J C V and flj and tt J defined as above, we have: 

(i) \j w g w, Q( W ) ^ n,j{wj) + n J (wjc), 

(ii) \/w £ M. p , if min j \wj\ ^ maxjgjc \wj\ , then Q(w) = ilj(wj) + il J (wja), 
(Hi) fl J is a norm on R* 7 if and only if J is a stable set. 

6.2 Sparsity patterns 

In this section, we do not make any assumptions regarding the correct specification of the linear 
model. We show that with probability one, only stable support sets may be obtained (see proof in 
the appendix). For simplicity, we assume invertibility of X T X , which forbids the high-dimensional 
situation p)nwe consider in Section 16.31 but we could consider assumptions similar to the ones 
used in [2]. 

Proposition 5 (Stable sparsity patterns) Assume y g R n has an absolutely continuous density 
with respect to the Lebesgue measure and that X T X is invertible. Then the minimizer w of Eq. (0) 
is unique and, with probability one, its support Supp(u)) is a stable set. 

6.3 High-dimensional inference 

We now assume that the linear model is well-specified and extend results from [28] for sufficient 
support recovery conditions and from |27] for estimation consistency. As seen in Proposition 01 
the norm £1 is decomposable and we use this property extensively in this section. We denote by 
p(J) = min£ C ,/c F ^ BU p( B j F ^ ; by submodularity and monotonicity of F, p(J) is always between 
zero and one, and, as soon as J is stable it is strictly positive (for the €i-norm, p( J) = 1). Moreover, 
we denote by c(J) = sup^gjy, J7j(wj)/||wj||2, the equivalence constant between the norm flj and 
the £2-norm. We always have c(J) ^ | J] 1 ' 2 maxfcgy F({k}) (with equality for the ^i-norm). 

The following propositions allow us to get back and extend well-known results for the £i-norm, i.e., 
Propositions [5] and [S] extend results based on support recovery conditions [28) ; while Propositions 
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[7] and [5] extend results based on restricted eigenvalue conditions (see, e.g., [H]). We can also get 
back results for the ^i/^oo-norm |14j . As shown in the appendix, proof techniques are similar and 
are adapted through the decomposition properties from Proposition 0J 

Proposition 6 (Support recovery) Assume that y = Xw* + ae, where e is a standard multi- 
variate normal vector. Let Q = —X T X € R pxp . Denote by J the smallest stable set containing the 
support Supp(u;*) of w* . Define v = min^™*^ |w*| > 0, assume k = A m ; n (Qjj) > and that for 
rj > 0, (Q J )* [{Qj(Qj}Q.jj))jeJ a ] ^ 1 — V- Then, if A 2e(J) > ^ e TTiinimizer w is unique and has 

support equal to J, with probability larger than 1 — 3P(f2*(z) > Al?p ^ v/ " ) ; where z is a multivariate 
normal with covariance matrix Q. 

Proposition 7 (Consistency) Assume that y = Xw* + ae, where e is a standard multivariate 
normal vector. Let Q = -^X T X £ MP xp . Denote by J the smallest stable set containing the 
support Supp(iu*) of w* . Assume that for all A such that f2 J (Ajc) 3f2j(Aj), A T QA k||Aj|||- 

Then we have Q(w — w*) ^ 2 ^p(j)^ an< ^ nll^^ ~ ^ w *ll2 ^ ^k$~jW > probability larger than 

i-p(n*(z) > wh ere z is a multivariate normal with covariance matrix Q. 

Proposition 8 (Concentration inequalities) Let z be a normal variable with covariance matrix 
Q. Let T be the set of stable inseparable sets. Then P(Q,*(z) > t) ^ TliAeT ex P ( — " /i " ) ■ 



7 Experiments 

We provide illustrations on toy examples of some of the results presented in the paper. We consider 
the regularized least-squares problem of Eq. ©, with data generated as follows: given p,n,k, the 
design matrix X G M" xp is a matrix of i.i.d. Gaussian components, normalized to have unit £2-norm 
columns. A set J of cardinality k is chosen at random and the weights uij are sampled from a 
standard multivariate Gaussian distribution and w* JC = 0. We then take y = Xw* +n~~ 1 / 2 \\Xw*\\2 e 
where e is a standard Gaussian vector (this corresponds to a unit signal-to- noise ratio) . 

Proximal methods vs. subgradient descent. For the submodular function F(A) = IA] 1 / 2 
(a simple submodular function beyond the cardinality) we compare three optimization algorithms 
described in Section [5l subgradient descent and two proximal methods, ISTA and its accelerated 
version FISTA [25] . for p = n = 1000, k = 100 and A = 0.1. Other settings and other set- functions 
would lead to similar results than the ones presented in Figure |U FISTA is faster than ISTA, and 
much faster than subgradient descent. 

Relaxation of combinatorial optimization problem. We compare three strategies for solv- 
ing the combinatorial optimization problem mintugjjp ^\\y — ^Hl! A-F(Supp(w)) with F(A) = 
tr(X^Xyt) 1 / 2 , the approach based on our sparsity-inducing norms, the simpler greedy (forward se- 
lection) approach proposed in [8j [3] , and by thresholding the ordinary least-squares estimate. For 
all methods, we try all possible regularization parameters. We see in the right plots of Figure |4] that 
for hard cases (middle plot) convex optimization techniques perform better than other approaches, 
while for easier cases with more observations (right plot), it does as well as greedy approaches. 

Non factorial priors for variable selection. We now focus on the predictive performance and 
compare our new norm with F(A) = tr(XjX^) 1 / 2 , with greedy approaches [3] and to regularization 
by £i or £2 norms. As shown in Table [IJ the new norm based on non- factorial priors is more robust 
than the ^i-norm to lower number of observations n and to larger cardinality of support k. 
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20 40 

time (seconds) 



5 0.5 




thresholded OLS 
greedy 
submodular 



20 
penalty 



40 



« 0.5 




20 
penalty 



Figure 4: (Left) Comparison of iterative optimization algorithms (value of objective function vs. run- 
ning time). (Middle/Right) Relaxation of combinatorial optimization problem, showing residual er- 
ror — — -X^ll! vs. penalty F(Supp(u))): (middle) high-dimensional case (p = 120, n = 20, k = 40), 
(right) lower-dimensional case (p = 120, n = 120, k = 40). 



p 


n 


k 


submodular 


£2 vs. submod. 


£\ vs. submod. 


greedy vs. submod. 


120 


120 


80 


40.8 ± 0.8 


-2.6 ± 0.5 


0.6 ± 0.0 


21.8 ± 0.9 


120 


120 


40 


35.9 ± 0.8 


2.4 ± 0.4 


0.3 ± 0.0 


15.8 ± 1.0 


120 


120 


20 


29.0 ± 1.0 


9.4 ± 0.5 


-0.1 ± 0.0 


6.7 ± 0.9 


120 


120 


10 


20.4 ± 1.0 


17.5 ± 0.5 


-0.2 ± 0.0 


-2.8 ± 0.8 


120 


120 


6 


15.4 ± 0.9 


22.7 ± 0.5 


-0.2 ± 0.0 


-5.3 ± 0.8 


120 


120 


4 


11.7 ± 0.9 


26.3 ± 0.5 


-0.1 ± 0.0 


-6.0 ± 0.8 


120 


20 


80 


46.8 ± 2.1 


-0.6 ± 0.5 


3.0 ± 0.9 


22.9 ± 2.3 


120 


20 


40 


47.9 ± 1.9 


-0.3 ± 0.5 


3.5 ± 0.9 


23.7 ± 2.0 


120 


20 


20 


49.4 ± 2.0 


0.4 ± 0.5 


2.2 ± 0.8 


23.5 ± 2.1 


120 


20 


10 


49.2 ± 2.0 


0.0 ± 0.6 


1.0 ± 0.8 


20.3 ± 2.6 


120 


20 


6 


43.5 ± 2.0 


3.5 ± 0.8 


0.9 ± 0.6 


24.4 ± 3.0 


120 


20 


4 


41.0 ± 2.1 


4.8 ± 0.7 


-1.3 ± 0.5 


25.1 ± 3.5 



Table 1: Normalized mean-square prediction errors \\Xw—Xw* \W/n (multiplied by 100) with optimal 
rcgularization parameters (averaged over 50 replications, with standard deviations divided by \/50)- 
The performance of the submodular method is shown, then differences from all methods to this 
particular one are computed, and shown in bold when they are significantly greater than zero, as 
measured by a paired t-test with level 5% (i.e., when the submodular method is significantly better). 
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8 Conclusions 



We have presented a family of sparsity-inducing norms dedicated to incorporating prior knowledge 
or structural constraints on the support of linear predictors. We have provided a set of common 
algorithms and theoretical results, as well as simulations on synthetic examples illustrating the 
good behavior of these norms. Several avenues are worth investigating: first, we could follow cur- 
rent practice in sparse methods, e.g., by considering related adapted concave penalties to enhance 
sparsity-inducing norms, or by extending some of the concepts for norms of matrices, with potential 
applications in matrix factorization or multi-task learning (see, e.g., |29| for application of submod- 
ular functions to dictionary learning). Second, links between submodularity and sparsity could be 
studied further, in particular by considering submodular relaxations of other combinatorial func- 
tions, or studying links with other polyhedral norms such as the total variation, which are known 
to be similarly associated with symmetric submodular set- functions such as graph cuts |26j . 
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(MGA Project) and the European Research Council (SIERRA Project). The author would like to 
thank Edouard Grave, Rodolphc Jenatton, Armand Joulin, Julien Mairal and Guillaume Obozinski 
for discussions related to this work. 
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A Properties of the norm 



A.l Proof of Proposition Q] 

(i) is positively homogeneous by definition of the Lovasz extension in Eq. (JTJ), convex because of 
the representation in Eq. ^ as the maximum of s T |w| for some s G V C R5_, and it is a norm as 
soon as H(w) = implies that w = 0, which is true since Q(w) ^ min^. ^{AOOIIHIoq- (ii) We denote 
by g* the Fenchel conjugate of g on the domain {w G W, ||u>||oo ^ 1}, and g** its bidual [18]. By 
definition of the Fenchel conjugate, we have: 

g*(s) = max w T s — g(w) 

IMIoo^l 

= max max (5ow) T s— f(S) 

<5G{0,l} p 

= max S T \s\~f{5) 
<5e{o,i}p 

= max 8 T \s\ — f(8) because F — |s| is submodular. 

<5e[o,i]p 



Thus, for all w such that ||w||oo ^ 1, 
q**(w) = maxs T w — q*(s) 

= max min s T w — 5 T \s\ + f (5) 

s£Rp <5<E[0,1] P 

= min max s T w — <5 T |s| + f (5) by strong duality and Slater's condition 1181 
= min f(S) = f(\w\) because F is nonincrcasing. 

Note that i* 1 non-increasing implies that / is non-increasing with respect to all of its components, (ii) 

We have Q(w) = f(\w\) = maxs T |u>| = max s T w = max s T w = max s T w, 

l-leT' II»a||i<f(A), acv maXAcy ^i <1 

which implies the desired result. Note that the maximization may indeed be limited to the stable 
inseparable sets A G T. 



A. 2 Proof of Proposition [2] 

We have seen in Section [2] that for A G T (set of stable inseparable sets), then = is a 

face of (and those sets are the only ones for which this happens). We get to the desired result by 
considering potential different signs. 



B Convex optimization results 

We first prove an additional result related to decomposition of subdifferentials. Note that the exact 
subdifferential for the non-zero components of w is rather complicated when w has components with 
equal magnitude. If this is not the case, i.e., \wj t \ > ■•• > \wj k \ > 0, where k — |J|, then the 
subdifferential dttj(wj) is reduced to a point s such that Sj k = F({ji, . . . ,jk}) — F({j 1} . . . , jk-i})- 
For more details on the subdifferential for nonzero components, see [12) . 
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Lemma 1 (Decomposition of subdifferential) Let w £ MP, with support J = Supp(w) and 
with H equal to the smallest stable set containing J. The subdifferential dQ(w) at w, can then be 
decomposed as follows onM v = M J xM H \ J xM 11 " : dft(w) = dflj(wj)x{0}x{s H c, (Q, H )*(s H a) < 1}. 

Proof For all sufficient small A £ MP, the components in (w + A) j have all greater absolute values 
than the ones in (w + A),/c Thus, from Proposition H ft(w + A) = Qj(wj + A, 7 ) + £l 7 (A,/c) = 
Qj(wj + A,/) + n H (AH<=), and thus the subdifferential decomposes as dClj(wj) x {0} x dQ H (0). 
The subdifferential of a norm at zero is exactly the unit ball of the dual norm, which leads to the 
desired result. ■ 



B.l Proof of Proposition [3] 

Following [B], without loss of generality, we assume that z has nonnegative components. We have 
by convex duality (which is applicable here because of Slater's condition): 

min — — z\\% + \fl(w) = min max — \\w — z\\\ + Xs T w 



-\\w — z\\\ + Xs T w 



uGK p 2 wGRp n*(s)^l 2 

. 1 

= max mm — 

o*( s )^i wei&p 2 

= max — II z\\ o II As — z II o, 

!i«( s Ki2" 112 2" ll2 ' 



where the (unique) optimal w is obtained from the optimal sbyw = z- As. s is defined constrained 
to satisfy f2*(s) ^ 1, which is equivalent to \s\ e V . Since z has nonnegative components, the 
minimum restricted to \s\ S V is the same as the minimum restricted to s S V , and also the same as 
the one restricted to the submodular polyhedron without constraints on positivity, i.e., our problem 

:V,s(A)cF(A) 



reduces to mm.\/Acv,s<A)cF(A) II s — ^/-MIIj which is also equivalent to 



min ||/|| 2 . (5) 



VAcV,t(A)CF(A)-\- 1 z(A) 

Up to the constraints s(V) = F(V) — \~ 1 z(V), this is the minimum-norm point problem for the 
submodular function G : A M> F(A) — \~ 1 z(A). We can then follow two approaches: the first one 
is to apply directly the minimum- norm point algorithm to the problem in Eq. ([5]), which we have 
followed in simulations. The second approach is to consider the regular minimum point algorithm; we 
can then follow [12[ Lemma 7.4]: if t is the minimum-norm solution for the submodular function G, 
then we can obtain s as X~ 1 z plus the negative part of t. From s we then get w through w = z — As. 

If another algorithm is used for submodular function minimization, then, following [12[ Lemma 7.4] , 
we know which components of the (unique) optimal value t* are negative and which of them are 
equal to zero (which corresponds to zero components of w*). Then, following [26], if we add a 
constant vector with components equal to a to z, we may obtain level sets of w* . With several 
values of a, we can then obtain the full solution w* . However, the minimum norm point algorithm 
remains the most efficient and allows to obtain directly the solution of the proximal problem. 
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C Sparse estimation 



In this section, we consider a design X £ M. nxp be a fixed design and y £ K™ a set of random 
responses. Given A > 0, we define w as a minimizer of the regularized least-squares cost: 

min — \\y-Xw\\% + \Sl(w). (6) 

wen? 2n y y ' 

C.l Proof of Proposition [4] 

(i) for s £ R p + , if VB C J, s(B) < F(B) and VC C J c , s(C) < F(C R J) - F(J), then VA C V, 
s(A) = s(A n J) + s(A n J c ) < F(A n J) + F(A U J) — F(J) < F(A) by submodularity. This 
implies that the desired result by considering the representation of the Lovasz extension in Eq. (|T|) 
and the fact that we have just prove that V contains the product of the two submodular polyhcdra 
associated to F J and Fj. 

(ii) This is immediate from the expression of the Lovasz extension in Eq. ([1]). Indeed, the order 
within J and the one within J c do not interact. Note that this case includes cases where we some 
of the components of \wj\ are equal to some of |iuj<=|. 

(iii) fl J corresponds to the submodular function obtained as the contraction of F by J. It is thus a 
norm as soon as F J is positive on all singletons, which is itself equivalent to the stability of J. The 
equivalence of being a norm with stability of the set J is then straightforward. 

C.2 Proof of Proposition [5] 

Let Q = ^X T X £ W xp and r = ^X T y e R p . The unicity of the minimizer w is a consequence 
of the invcrtibility of Q = -^X T X. Let J C V. We will show that if Supp(i&) = J, then wj is an 
affine function of r (and hence y), of the form wj = {Qj} — Ajj)rj + bj, where =^ Ajj =4 Qjj 
and {Ajj,bj) belongs to a finite set independent of r. 

If J is not a stable set, then, by Proposition [TJ this will implies that there exists j <G J c such that 
QjAiQjj - Ajj)rj + bj] - rj ) = 0, i.e., 

= QjAiQjj - Ajj)rj + bj] - rj = ^[QjjQjjX] - Xj - QjjAjjXjjy + Q 3 jbj. 

The row vector QjjQjjXj — Xj — QjjAjjXj cannot be equal to zero, otherwise, 

= — [QjjQjjXj —Xj e — QjjAjjXj]Xj = QjjQj}Qji — Qjj - QjjAjjQjj ^ QjjQjjQji - Qjj 

which is a contradiction because of the invertibility of Q and the Schur complement lemma |30j 
(which implies that the previous quantity must be strictly negative). Thus, we have shown that if 
Supp(w) = J and J is not a stable subset, then for a finite number of non zero (c, d) £ R™ x R, then 
c T y is constant. This occurs with probability zero. 

What remains to be shown is the affine representation of wj when the support is given; it is essentially 
equivalent to showing that the path is pieccwise affine, which is not surprising for a polyhedral 
norm [3T]. We use the representation Vtj(wj) = max z gg z T wj where B is the finite set of z such 
that \z\ in an extreme point of the submodular polyhedron associated with f2j. 

Necessary optimality conditions |32j for such the problem in Eq. is the existence of r\ z ^ (for 
each z £ B) such that (1) X^es = L (2) T\ z = if z is not a maximizer of max z6 s z T wj, and 
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(3) wj is a minimizer of \w]QjjWj - rjwj + Xw T J2zeA VzZ, i - e -> Q.JJ W J + ^HzeA 7 !* 2 = r J- 
Moreover, by Caratheodory's theorem [321, the number k of non-zero 77 may be taken to be less than 
|J| + 1. 

This thus implies that, if consider the vector ( G M fc of non-zero 77, and the matrix Z € Rl J l xfc of 
corresponding z's, then we have 

Qjjwj + \Z( = rj 

C T i = i 

3c e K such that Z T wj = cl. 
In matrix form, this can be written as: 

Qjj XZ 
AZ T -Al 
— A1 T 

It is then a simple linear algebra exercise to show that if k < | J| + 1, then wj is of the desired form. 
C.3 Proof of Proposition [6] 

Let q = — X T e £ R p , which is normal with mean zero and covariance matrix a 2 Q/n. We have 
Vl(x) ^ n,j{xj) + fl J {xj c ) > n.j(xj) + p(J)Vljc(xjc) ^ p{J)Vl(x). This implies that VI* (q) ^ 
p(J)^ 1 ma,x{Vl*j(qj), (Vl J )* (qjc)}. Moreover, qjc — QjcjQjjqj is normal with covariance matrix 
<7 2 /n(Qjcjc — QjcjQ^jQjjc) =4 & 2 /nQj°j c - This implies that with probability larger than 1 — 
3P(Vl*(q) > Xp(J)ri/2), we have Vl*j(qj) A/2 and {Vl J )*{ qj c - Qj.jQjjqj) < Ary/2. 

We denote by w the unique (because Qjj is invertible) minimum of j- — Af?x>||2 + Afi(w), subject 
to zijjc = 0. wj is defined through Qjj(wj ~ Wj*) — qj = — Xsj where sj € dVtj(wj) (which implies 
that Vlj(sj) < 1) , i.e., wj - i/j} = Qj}(qj - Xsj). We have: 





*J- w */llco < max|(57Qj}(<7j - Asj)| 



< maxc(J)||Q7X|| 2 [f2}( gj ) + XVl*j( Sj )] < |ac(J)k" 1 
Thus if 2Ac(J)k~ 1 ^ then Supp(?i) D Supp(w*). 

We now show that since we have (Vl J )*(qjc — Qj<=jQj}qj) *S Xrj/2, 10 is the unique minimizer of 
Eq. ((6]). For that it suffices to show that {Vl J )*(Qj-=j(wj — w*j) — qjc) < A. We have: 

(n J )*(Qj°j(wj-w*j)-qjc) = (n J )*(Q.j,jQ- J ){ q .j-\s,j)~qj.) 

sC {Vl J Y{QjejQ- J 1 J qj - qjc) + X{Vl J )*{QjojQ-)sj) 

*S (n'TiQjcjQjjqj - 9.7=) + X{Vl J Y[{Vlj{Q- J )Qj j )) j& je] 

^ Xr)/2 + A(l - 7/) < A 

which leads to the desired result. 
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C.4 Proof of Proposition [7] 

Like for the proof of Proposition [SI we have Q(x) ^ £lj(xj) + £l J (xjc) ^ £lj(xj) + p(J)fljc(xjc) ^ 
p{J)Q{x). Thus, if we assume Q*(q) < Xp(J)/2, then fi}(<?j) < A/2 and (rj J )*(q, 7 c) < A/2. Let 
A = w — w* . 

We follow the proof from |33j by using the decomposition property of the norm f2. We have, by 
optimality of w: 

^A T QA + Xfl(w* + A) + q T A ^ \n{w* + A) + q T A < Xfl(w*) 

Using the decomposition property, 

Xnj({w* + A), 7 ) + Xn J ((w* + A) jo) + q]Aj + qJcAjc < Afij(u;}) 

xn J (Ajc) < xnj(wj) - AfijK + a, 7 ) + o}( 9j )o 7 (Aj) + (o- / )*( gjo )n- / (A J c) 
(A - (n J )*( g/ c))n J (Ajc) < (A + Qjiqj^njiAj) 

Thus ri J (Ajc) ^ 3£2j(Aj), which implies A T QA ^ k||Aj||| (we have assumed a restricted eigen- 
value condition). Moreover, we have: 

A T QA = A T (QA) < f2(A)ST(QA) 

3A 

< fi(A)(Q*(g) + A) < —0(A) by optimality of w 
fi(A) < a 7 (A, 7 ) + p(J)- 1 J (A J c) 

This implies that ^Slj(Aj) 2 < k||Aj||§ A T QA < -g^jfAj), and thus Clj(Aj) < ^g^, 
which leads to the desired result, given the previous inequalities. 

C.5 Proof of Proposition [5] 

We have Sl*(z) = maxn( m vi w T z; the maximum can be taken over the set of extreme points of the 
unit ball, which leads to the desired result given Proposition |5J 
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